Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: metrics #2464

Merged
merged 10 commits into from
Jul 22, 2024
Merged

refactor!: metrics #2464

merged 10 commits into from
Jul 22, 2024

Conversation

Arqu
Copy link
Collaborator

@Arqu Arqu commented Jul 5, 2024

Description

  • Cleans up stale counters
  • Introduces unique daily node counts on relays
  • Introduces a metrics dumper so we can create dumps of metrics and also view them in doctor plot
  • Introduces new trace level counters

Breaking Changes

  • Introduces metrics_dump_path on the top level CLI parameters, which if set will make sure metrics are collected at regular intervals and written to the provided path in CSV format.

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@Arqu Arqu mentioned this pull request Jul 17, 2024
4 tasks
@Arqu Arqu self-assigned this Jul 19, 2024
@Arqu Arqu added this to the v0.21.0 milestone Jul 19, 2024
@Arqu Arqu marked this pull request as ready for review July 19, 2024 11:22
@@ -72,7 +72,7 @@ package = "iroh-quinn"
version = "0.10"

[features]
default = ["fs-store"]
default = ["fs-store", "metrics"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this means metrics collection is on by default in blobs now, or just that the feature is available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked so that dep is always available but metrics collection is not on by default.

@@ -9,7 +9,7 @@ use struct_iterable::Iterable;
pub struct Metrics {
pub pkarr_publish_update: Counter,
pub pkarr_publish_noop: Counter,
pub pkarr_publish_error: Counter,
// pub pkarr_publish_error: Counter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray comment, plz remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR was for the team to claim which ones they would like to keep as they are unused. I'm going to nuke all the ones people didn't vote to keep.

Comment on lines 10 to 22
pub rebind_calls: Counter,
// pub rebind_calls: Counter,
pub re_stun_calls: Counter,
pub update_endpoints: Counter,

// Sends (data or disco)
pub send_relay_queued: Counter,
pub send_relay_error_chan: Counter,
pub send_relay_error_closed: Counter,
pub send_relay_error_queue: Counter,
// pub send_relay_queued: Counter,
// pub send_relay_error_chan: Counter,
// pub send_relay_error_closed: Counter,
// pub send_relay_error_queue: Counter,
pub send_ipv4: Counter,
pub send_ipv4_error: Counter,
// pub send_ipv4_error: Counter,
pub send_ipv6: Counter,
pub send_ipv6_error: Counter,
// pub send_ipv6_error: Counter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented-out code

@@ -40,7 +40,7 @@ pub struct Metrics {
pub sent_disco_ping: Counter,
pub sent_disco_pong: Counter,
pub sent_disco_call_me_maybe: Counter,
pub recv_disco_bad_peer: Counter,
// pub recv_disco_bad_peer: Counter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented-out code

@@ -49,7 +49,7 @@ pub struct Metrics {
pub recv_disco_ping: Counter,
pub recv_disco_pong: Counter,
pub recv_disco_call_me_maybe: Counter,
pub recv_disco_call_me_maybe_bad_node: Counter,
// pub recv_disco_call_me_maybe_bad_node: Counter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@@ -9,7 +9,7 @@ use struct_iterable::Iterable;
pub struct Metrics {
pub pkarr_publish_update: Counter,
pub pkarr_publish_noop: Counter,
pub pkarr_publish_error: Counter,
// pub pkarr_publish_error: Counter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is obviously guessing at usage. but i'm surprised you want to remove this. i'm used to seeing most counters in at least a pair of _total and _error so you can do error ratios.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however these are unused counters and I think those can be misleading as you expect to see something in errors yet they never increment even if there was an error.

@@ -1762,25 +1762,30 @@ impl Actor {
};

loop {
inc!(Metrics, actor_tick_main);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these _tick_ metrics are basically trace-level metrics. is this really a good idea? We have trace-level logging for this. What do you expect these to be used for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we used these quite a lot in the past to detect the health of tokio loops

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So 2 reasons for this:

  • In general I'd like to move away from trace logging and shift towards trace metrics. Much easier to reason with it on a chart. Though logs do have their place where explicit order is really important.
  • We actually used these in development MUCH more and we're more useful to us than "product level health" metrics which you look at when deployed.

I could follow up on this and maybe introduce a split for "trace_metrics" vs "metrics" if people find these too much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea is to use metrics in your dev workflow too.

@Arqu Arqu force-pushed the arqu/fresh_counters branch from d94a592 to 53bb3bd Compare July 22, 2024 14:41
@Arqu Arqu requested a review from dignifiedquire July 22, 2024 15:23
Arqu added a commit that referenced this pull request Jul 22, 2024
## Description

Extends several pieces:
- introduces a "metric dumper" which is just a way of saying you can
sample the internal metrics and write them to a CSV file (which should
come in handy for 3 pieces that should come down the line; 1) CI using
these to validate behavior, 2) debug dumps from 3rd parties, 3) local
debugging)
- along with the dumper the `doctor plot` has been extended to be able
to read those dumps and also just generally improved some rough edges so
it's less error prone.
- node counts are here for relays. You now have a derived metric which
simply counts unique daily node connections.


Sample usage for the metrics dumper:
`cargo run --bin iroh --all-features -- --metrics-dump-path
test.metrics.csv start`

Sample of the plotter:
`cargo run --bin iroh --all-features -- doctor plot --timeframe 30
--interval 10 --file metrics.dump.csv
magicsock_actor_tick_main_total,magicsock_actor_tick_msg_total,magicsock_actor_tick_endpoint_heartbeat_total,magicsock_actor_tick_endpoints_update_receiver_total,magicsock_actor_tick_re_stun_total`

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

This is merging into #2464 as
part of the larger metrics refactor.

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.
@dignifiedquire dignifiedquire enabled auto-merge July 22, 2024 15:55
@dignifiedquire dignifiedquire changed the title refactor: metrics refactor!: metrics Jul 22, 2024
Arqu added 9 commits July 22, 2024 18:43
## Description

Extends several pieces:
- introduces a "metric dumper" which is just a way of saying you can
sample the internal metrics and write them to a CSV file (which should
come in handy for 3 pieces that should come down the line; 1) CI using
these to validate behavior, 2) debug dumps from 3rd parties, 3) local
debugging)
- along with the dumper the `doctor plot` has been extended to be able
to read those dumps and also just generally improved some rough edges so
it's less error prone.
- node counts are here for relays. You now have a derived metric which
simply counts unique daily node connections.


Sample usage for the metrics dumper:
`cargo run --bin iroh --all-features -- --metrics-dump-path
test.metrics.csv start`

Sample of the plotter:
`cargo run --bin iroh --all-features -- doctor plot --timeframe 30
--interval 10 --file metrics.dump.csv
magicsock_actor_tick_main_total,magicsock_actor_tick_msg_total,magicsock_actor_tick_endpoint_heartbeat_total,magicsock_actor_tick_endpoints_update_receiver_total,magicsock_actor_tick_re_stun_total`

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

This is merging into #2464 as
part of the larger metrics refactor.

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.
@Arqu Arqu force-pushed the arqu/fresh_counters branch from a5db957 to f4128fe Compare July 22, 2024 16:50
@dignifiedquire dignifiedquire added this pull request to the merge queue Jul 22, 2024
Merged via the queue into main with commit 09e9746 Jul 22, 2024
25 of 26 checks passed
@Arqu Arqu deleted the arqu/fresh_counters branch July 22, 2024 18:24
matheus23 pushed a commit that referenced this pull request Nov 14, 2024
## Description

- Cleans up stale counters
- Introduces unique daily node counts on relays
- Introduces a metrics dumper so we can create dumps of metrics and also
view them in doctor plot
- Introduces new trace level counters

## Breaking Changes

- Introduces `metrics_dump_path` on the top level CLI parameters, which
if set will make sure metrics are collected at regular intervals and
written to the provided path in CSV format.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants